To explore and visualize the dataset, build a linear regression model to predict the prices of used cars, and generate a set of insights and recommendations that will help the business.
To preprocess the raw data, analyze it, and build a linear regression model to predict the price of used cars.
The data contains the different attributes of used cars sold in different locations. The detailed data dictionary is given below.
Each record in the dataset provides a description of an car and its price. Some cars will have current price if that car was bought new.
Attribute Information (in order):
In order to understand the data from cars4u and provide a model that predicts prices of used cars, business recomendations we are going to do below steps
# Format Python code structure
%load_ext nb_black
# Numpy & Pandas to import, process & understand data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
# setting graph plots background style as darkgrid
sns.set_theme(style="darkgrid")
# to generate graphs/plots inline in jupyter notebook
%matplotlib inline
# to round numerical columns to 3 digit decimal
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# to split the data into train and test
from sklearn.model_selection import train_test_split
# to build linear regression_model
from sklearn.linear_model import LinearRegression
# to check model performance
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# I'm setting the random seed via np.random.seed so that
# I get the same random results every time
np.random.seed(1)
# Turning off the warning SettingWithCopyWarning
pd.set_option("mode.chained_assignment", None)
In this section the goals is to load the data and then to check its properties, size, data types.
reading csv file and load data to data frame
# data frame names
df=None
data=None
#file name
inputFileName="used_cars_data.csv"
try:
data = pd.read_csv(inputFileName, index_col="S.No.")
except FileNotFoundError as err:
print(f'Error : cannot import file : {inputFileName}, Error : {err.args[1]} ')
finally:
if not isinstance(data,pd.core.frame.DataFrame):
print(f'Cannot proceed further, Dataframe is null');
#dont want to stop the process in notebook, ideally next steps would not proceed.
else:
print(f'input file : {inputFileName} data imported.');
df = data.copy()
input file : used_cars_data.csv data imported.
lets create common methods that we use multiple time during intial process and while data preprocessing
# building common methods that we will use multiple times to check data and during data pre processing
def data_shape(df):
"""
Prints total rows and columns in input data frame
"""
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")
def print_data_types(df):
"""
Prints data types, and non null count for all columns in input data frame
"""
print("\nData Type of each column\n")
print(df.info())
def print_null_info(df):
"""
Prints total null value count(s) for all columns in input data frame
"""
print("\nTotal Null value counts\n")
print(df.isnull().sum().sort_values(ascending=False))
def print_category_value_counts(df, column_names):
"""
Prints unique value counts, top 10 value & count(s) for all category columns in input data frame
"""
print()
for typeval, col in zip(df[column_names].dtypes, df[column_names]):
print()
print(f"Column name : {col} has total {df[col].nunique()} unique values")
print()
print(df[col].value_counts()[0:10])
print()
print("-" * 50)
Check the fews rows of cars data, and check out its info() and describe() methods.
check random 10 rows how data looks
df.sample(n=10)
| Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Mileage | Engine | Power | Seats | New_Price | Price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S.No. | |||||||||||||
| 2397 | Ford EcoSport 1.5 Petrol Trend | Kolkata | 2016 | 21460 | Petrol | Manual | First | 17.0 kmpl | 1497 CC | 121.36 bhp | 5.000 | 9.47 Lakh | 6.000 |
| 3777 | Maruti Wagon R VXI 1.2 | Kochi | 2015 | 49818 | Petrol | Manual | First | 21.5 kmpl | 1197 CC | 81.80 bhp | 5.000 | 5.44 Lakh | 4.110 |
| 4425 | Ford Endeavour 4x2 XLT | Hyderabad | 2007 | 130000 | Diesel | Manual | First | 13.1 kmpl | 2499 CC | 141 bhp | 7.000 | NaN | 6.000 |
| 3661 | Mercedes-Benz E-Class E250 CDI Avantgrade | Coimbatore | 2016 | 39753 | Diesel | Automatic | First | 13.0 kmpl | 2143 CC | 201.1 bhp | 5.000 | NaN | 35.280 |
| 4514 | Hyundai Xcent 1.2 Kappa AT SX Option | Kochi | 2016 | 45560 | Petrol | Automatic | First | 16.9 kmpl | 1197 CC | 82 bhp | 5.000 | NaN | 6.340 |
| 599 | Toyota Innova Crysta 2.8 ZX AT | Coimbatore | 2019 | 40674 | Diesel | Automatic | First | 11.36 kmpl | 2755 CC | 171.5 bhp | 7.000 | 28.05 Lakh | 24.820 |
| 186 | Mercedes-Benz E-Class E250 CDI Avantgrade | Bangalore | 2014 | 37382 | Diesel | Automatic | First | 13.0 kmpl | 2143 CC | 201.1 bhp | 5.000 | NaN | 32.000 |
| 305 | Audi A6 2011-2015 2.0 TDI Premium Plus | Kochi | 2014 | 61726 | Diesel | Automatic | First | 17.68 kmpl | 1968 CC | 174.33 bhp | 5.000 | NaN | 20.770 |
| 4582 | Hyundai i20 1.2 Magna | Kolkata | 2011 | 36000 | Petrol | Manual | First | 18.5 kmpl | 1197 CC | 80 bhp | 5.000 | NaN | 2.500 |
| 5434 | Honda WR-V Edge Edition i-VTEC S | Kochi | 2019 | 13913 | Petrol | Manual | First | 17.5 kmpl | 1199 CC | 88.7 bhp | 5.000 | 9.36 Lakh | 8.200 |
How many rows and columns in input data?
# check data shape, total rows and columns
data_shape(df)
There are 7253 rows and 13 columns.
lets check all data types and non null value counts
# check all data types and not null counts
print_data_types(df)
Data Type of each column <class 'pandas.core.frame.DataFrame'> Int64Index: 7253 entries, 0 to 7252 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 7253 non-null object 1 Location 7253 non-null object 2 Year 7253 non-null int64 3 Kilometers_Driven 7253 non-null int64 4 Fuel_Type 7253 non-null object 5 Transmission 7253 non-null object 6 Owner_Type 7253 non-null object 7 Mileage 7251 non-null object 8 Engine 7207 non-null object 9 Power 7078 non-null object 10 Seats 7200 non-null float64 11 New_Price 1006 non-null object 12 Price 6019 non-null float64 dtypes: float64(2), int64(2), object(9) memory usage: 793.3+ KB None
# describe shows summary for all numerical columns, its min, max, mean, 25th,50th and 75th% percentile, st deviation,
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Year | 7253.000 | 2013.365 | 3.254 | 1996.000 | 2011.000 | 2014.000 | 2016.000 | 2019.000 |
| Kilometers_Driven | 7253.000 | 58699.063 | 84427.721 | 171.000 | 34000.000 | 53416.000 | 73000.000 | 6500000.000 |
| Seats | 7200.000 | 5.280 | 0.812 | 0.000 | 5.000 | 5.000 | 5.000 | 10.000 |
| Price | 6019.000 | 9.479 | 11.188 | 0.440 | 3.500 | 5.640 | 9.950 | 160.000 |
lets check which columns has some null values, how many null values
# check how many null values in each columns
print_null_info(df)
Total Null value counts New_Price 6247 Price 1234 Power 175 Seats 53 Engine 46 Mileage 2 Name 0 Location 0 Year 0 Kilometers_Driven 0 Fuel_Type 0 Transmission 0 Owner_Type 0 dtype: int64
lets check for any duplicate values
# check for any duplciate data
df[df.duplicated()]
lets check how category column values count and top 10 values for each column
checking values for Name, year, location, fuel type, transmission type, owner types
# print value types and value counts
print_category_value_counts(df, ["Name", "Year"])
Column name : Name has total 2041 unique values Mahindra XUV500 W8 2WD 55 Maruti Swift VDI 49 Maruti Swift Dzire VDI 42 Honda City 1.5 S MT 39 Maruti Swift VDI BSIV 37 Maruti Ritz VDi 35 Toyota Fortuner 3.0 Diesel 35 Hyundai Grand i10 Sportz 32 Honda Brio S MT 32 Honda City 1.5 V MT 32 Name: Name, dtype: int64 -------------------------------------------------- Column name : Year has total 23 unique values 2015 929 2014 925 2016 886 2013 791 2017 709 2012 690 2011 579 2010 407 2018 361 2009 252 Name: Year, dtype: int64 --------------------------------------------------
# print value types and value counts
print_category_value_counts(df, ["Location", "Fuel_Type"])
Column name : Location has total 11 unique values Mumbai 949 Hyderabad 876 Kochi 772 Coimbatore 772 Pune 765 Delhi 660 Kolkata 654 Chennai 591 Jaipur 499 Bangalore 440 Name: Location, dtype: int64 -------------------------------------------------- Column name : Fuel_Type has total 5 unique values Diesel 3852 Petrol 3325 CNG 62 LPG 12 Electric 2 Name: Fuel_Type, dtype: int64 --------------------------------------------------
# print value types and value counts
print_category_value_counts(df, ["Transmission", "Owner_Type"])
Column name : Transmission has total 2 unique values Manual 5204 Automatic 2049 Name: Transmission, dtype: int64 -------------------------------------------------- Column name : Owner_Type has total 4 unique values First 5952 Second 1152 Third 137 Fourth & Above 12 Name: Owner_Type, dtype: int64 --------------------------------------------------
# print value types and value counts
print_category_value_counts(df, ["Year", "Engine", "Power"])
Column name : Year has total 23 unique values 2015 929 2014 925 2016 886 2013 791 2017 709 2012 690 2011 579 2010 407 2018 361 2009 252 Name: Year, dtype: int64 -------------------------------------------------- Column name : Engine has total 150 unique values 1197 CC 732 1248 CC 610 1498 CC 370 998 CC 309 1198 CC 281 2179 CC 278 1497 CC 273 1968 CC 266 1995 CC 212 1461 CC 188 Name: Engine, dtype: int64 -------------------------------------------------- Column name : Power has total 385 unique values 74 bhp 280 98.6 bhp 166 73.9 bhp 152 140 bhp 142 78.9 bhp 128 67.1 bhp 126 67.04 bhp 125 82 bhp 124 88.5 bhp 120 118 bhp 111 Name: Power, dtype: int64 --------------------------------------------------
df["Seats"].value_counts()
5.000 6047 7.000 796 8.000 170 4.000 119 6.000 38 2.000 18 10.000 8 9.000 3 0.000 1 Name: Seats, dtype: int64
# checking data types
print_data_types(df)
Data Type of each column <class 'pandas.core.frame.DataFrame'> Int64Index: 7253 entries, 0 to 7252 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 7253 non-null object 1 Location 7253 non-null object 2 Year 7253 non-null int64 3 Kilometers_Driven 7253 non-null int64 4 Fuel_Type 7253 non-null object 5 Transmission 7253 non-null object 6 Owner_Type 7253 non-null object 7 Mileage 7251 non-null object 8 Engine 7207 non-null object 9 Power 7078 non-null object 10 Seats 7200 non-null float64 11 New_Price 1006 non-null object 12 Price 6019 non-null float64 dtypes: float64(2), int64(2), object(9) memory usage: 793.3+ KB None
# create a list those columns should be converted as category
cat_vars = ["Location", "Fuel_Type", "Transmission", "Owner_Type"]
for colname in cat_vars:
df[colname] = df[colname].astype("category")
print_data_types(df)
Data Type of each column <class 'pandas.core.frame.DataFrame'> Int64Index: 7253 entries, 0 to 7252 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 7253 non-null object 1 Location 7253 non-null category 2 Year 7253 non-null int64 3 Kilometers_Driven 7253 non-null int64 4 Fuel_Type 7253 non-null category 5 Transmission 7253 non-null category 6 Owner_Type 7253 non-null category 7 Mileage 7251 non-null object 8 Engine 7207 non-null object 9 Power 7078 non-null object 10 Seats 7200 non-null float64 11 New_Price 1006 non-null object 12 Price 6019 non-null float64 dtypes: category(4), float64(2), int64(2), object(5) memory usage: 595.9+ KB None
# strip kmpl and km/kg values from Mileage column
df["Mileage"] = df["Mileage"].str.rstrip(" kmpl")
df["Mileage"] = df["Mileage"].str.rstrip(" km/g")
# strip CC value from engine
df["Engine"] = df["Engine"].str.rstrip(" CC")
# strip bhp value from power
df["Power"] = df["Power"].str.rstrip(" bhp")
# create a list those columns should be converted as float
num_vars = ["Mileage", "Engine", "Power"]
for colname in num_vars:
df[colname] = df[colname].astype("float64")
print_data_types(df)
Data Type of each column <class 'pandas.core.frame.DataFrame'> Int64Index: 7253 entries, 0 to 7252 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 7253 non-null object 1 Location 7253 non-null category 2 Year 7253 non-null int64 3 Kilometers_Driven 7253 non-null int64 4 Fuel_Type 7253 non-null category 5 Transmission 7253 non-null category 6 Owner_Type 7253 non-null category 7 Mileage 7251 non-null float64 8 Engine 7207 non-null float64 9 Power 7078 non-null float64 10 Seats 7200 non-null float64 11 New_Price 1006 non-null object 12 Price 6019 non-null float64 dtypes: category(4), float64(5), int64(2), object(2) memory usage: 595.9+ KB None
# writing a method to convert price in string to number
def price_to_num(price_val):
"""This function takes in a string representing a price in INR
and converts it to a number. For example, '9.47 Lakh' becomes 9.47.
If the input is already numeric, which probably means it's NaN,
this function just returns np.nan."""
if isinstance(price_val, str): # checks if `price_val` is a string
multiplier = 1 # handles Lakh vs Cr salaries
if price_val.endswith("Lakh"):
multiplier = 1
elif price_val.endswith("Cr"):
multiplier = 100
return float((price_val.replace("Lakh", "").replace("Cr", ""))) * multiplier
else: # this happens when the current income is np.nan
return np.nan
df["New_Price"] = df["New_Price"].apply(price_to_num)
# lets check how data looks after concversion
df.sample(n=10)
| Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Mileage | Engine | Power | Seats | New_Price | Price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S.No. | |||||||||||||
| 2245 | Maruti Wagon R VXI 1.2 | Mumbai | 2015 | 55000 | Petrol | Manual | First | 21.500 | 1197.000 | 81.800 | 5.000 | 5.880 | 3.100 |
| 6407 | Maruti Swift LXI Option | Kochi | 2017 | 62889 | Petrol | Manual | First | 20.400 | 1197.000 | 81.800 | 5.000 | NaN | NaN |
| 1226 | Toyota Corolla Altis 1.8 VL CVT | Delhi | 2017 | 21000 | Petrol | Automatic | First | 14.280 | 1798.000 | 138.030 | 5.000 | 23.250 | 18.000 |
| 533 | Honda City 1.5 V AT | Mumbai | 2010 | 69000 | Petrol | Automatic | Second | 16.800 | 1497.000 | 118.000 | 5.000 | NaN | 2.850 |
| 6330 | Hyundai i20 Asta 1.2 | Kolkata | 2015 | 33000 | Petrol | Manual | First | 18.600 | 1197.000 | 81.830 | 5.000 | NaN | NaN |
| 5931 | Maruti Ertiga VXI | Mumbai | 2012 | 52500 | Petrol | Manual | Second | 16.020 | 1373.000 | 93.700 | 7.000 | NaN | 5.250 |
| 2784 | Nissan Micra Diesel XV | Hyderabad | 2011 | 86732 | Diesel | Manual | First | 23.080 | 1461.000 | 63.100 | 5.000 | NaN | 3.600 |
| 5180 | Skoda Superb Elegance 1.8 TSI AT | Bangalore | 2009 | 75000 | Petrol | Automatic | Second | 13.700 | 1798.000 | 157.750 | 5.000 | NaN | 4.950 |
| 2722 | Volkswagen Ameo 1.5 TDI Highline | Kochi | 2018 | 31426 | Diesel | Manual | First | 21.660 | 1498.000 | 108.620 | 5.000 | 9.720 | 7.900 |
| 5322 | Hyundai Grand i10 CRDi Sportz | Ahmedabad | 2016 | 30000 | Diesel | Manual | First | 24.000 | 1120.000 | 70.000 | 5.000 | NaN | 5.500 |
# check for data tyes
print_data_types(df)
Data Type of each column <class 'pandas.core.frame.DataFrame'> Int64Index: 7253 entries, 0 to 7252 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 7253 non-null object 1 Location 7253 non-null category 2 Year 7253 non-null int64 3 Kilometers_Driven 7253 non-null int64 4 Fuel_Type 7253 non-null category 5 Transmission 7253 non-null category 6 Owner_Type 7253 non-null category 7 Mileage 7251 non-null float64 8 Engine 7207 non-null float64 9 Power 7078 non-null float64 10 Seats 7200 non-null float64 11 New_Price 1006 non-null float64 12 Price 6019 non-null float64 dtypes: category(4), float64(6), int64(2), object(1) memory usage: 595.9+ KB None
From Name column we can determine the Model and Make
# Extract Make and Model From Name.
model_data = pd.DataFrame(
data=[
(i, i.split()[0].title(), " ".join(i.split()[1:])) if i == i else (i, "", "")
for i in df["Name"].str.strip().str.title()
],
columns=["Name", "Make", "Model"],
)
# add new columns to existing data frame
df["Make"] = model_data["Make"]
df["Model"] = model_data["Model"]
# make Make and Model as catrory column
df["Make"] = df["Make"].astype("category")
df["Model"] = df["Model"].astype("category")
# print value types and value counts
print_category_value_counts(model_data, ["Make", "Model"])
Column name : Make has total 32 unique values Maruti 1444 Hyundai 1340 Honda 743 Toyota 507 Mercedes-Benz 380 Volkswagen 374 Ford 351 Mahindra 331 Bmw 312 Audi 285 Name: Make, dtype: int64 -------------------------------------------------- Column name : Model has total 2017 unique values Swift Dzire Vdi 62 Xuv500 W8 2Wd 55 Swift Vdi 49 City 1.5 S Mt 39 Swift Vdi Bsiv 37 Swift Dzire Vxi 37 Fortuner 3.0 Diesel 35 Ritz Vdi 35 Alto Lxi 34 Amaze S I-Dtech 32 Name: Model, dtype: int64 --------------------------------------------------
lets determine age of the used car using current year - Car Made year
this will be a useful feature column to predict the price
df["Age_Of_Car"] = 2021 - df["Year"]
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Year | 7253.000 | 2013.365 | 3.254 | 1996.000 | 2011.000 | 2014.000 | 2016.000 | 2019.000 |
| Kilometers_Driven | 7253.000 | 58699.063 | 84427.721 | 171.000 | 34000.000 | 53416.000 | 73000.000 | 6500000.000 |
| Mileage | 7251.000 | 18.142 | 4.562 | 0.000 | 15.170 | 18.160 | 21.100 | 33.540 |
| Engine | 7207.000 | 1616.573 | 595.285 | 72.000 | 1198.000 | 1493.000 | 1968.000 | 5998.000 |
| Power | 7078.000 | 112.765 | 53.494 | 34.200 | 75.000 | 94.000 | 138.100 | 616.000 |
| Seats | 7200.000 | 5.280 | 0.812 | 0.000 | 5.000 | 5.000 | 5.000 | 10.000 |
| New_Price | 1006.000 | 22.780 | 27.759 | 3.910 | 7.885 | 11.570 | 26.043 | 375.000 |
| Price | 6019.000 | 9.479 | 11.188 | 0.440 | 3.500 | 5.640 | 9.950 | 160.000 |
| Age_Of_Car | 7253.000 | 7.635 | 3.254 | 2.000 | 5.000 | 7.000 | 10.000 | 25.000 |
Univariate analysis helps to check data skewness and possible outliers and spread of the data.
## building a Common method to generate graphs
def generate_univariate_chart(data, feature, hue=None, kde=False, bins=20):
"""
Builds histplot and boxplot for given field.
Can plot hue, kde and bins based on params, these are optional columns
"""
sns.set_style("darkgrid")
print(f"Genrating Charts for feature : {feature}")
# sns.set_context('poster',font_scale=1)
# figsize(width,height)
fig, axes = plt.subplots(2, figsize=(15, 15))
fig.suptitle("Univariate analysis for " + feature)
sns.histplot(
data=data,
x=feature,
ax=axes[0],
palette="winter",
bins=bins,
kde=kde,
hue=hue,
multiple="dodge",
)
sns.boxplot(
data=data, x=feature, ax=axes[1], showmeans=True, color="violet", hue=hue
)
Check histogram and boxplot for data spread, skewness and outliers
# with all params
generate_univariate_chart(data=df, feature="Price", hue=None, bins=10, kde=False)
Genrating Charts for feature : Price
Check histogram and boxplot for data spread, skewness and outliers
# with all params
generate_univariate_chart(data=df, feature="New_Price", hue=None, bins=10, kde=False)
Genrating Charts for feature : New_Price
Check histogram and boxplot for data spread, skewness and outliers
# with all params
generate_univariate_chart(
data=df, feature="Kilometers_Driven", hue=None, bins=10, kde=False
)
Genrating Charts for feature : Kilometers_Driven
it has one extreme outlier showing 6.5 million - we should fix this row
75% cars driven around 70K Kms
The distribution is heavily right-skewed, and one possible extreme outliers
Outlier car car is BMW X5 Xdrive 30D M Sport 4, 4 yrs old so lets fix the value has 65K Kms
df[df["Kilometers_Driven"] >= 6500000].index
df["Kilometers_Driven"][df[df["Kilometers_Driven"] >= 6500000].index] = 65000
Check histogram and boxplot for data spread, skewness and outliers
# with all params
generate_univariate_chart(
data=df, feature="Kilometers_Driven", hue=None, bins=10, kde=False
)
Genrating Charts for feature : Kilometers_Driven
# check how many null values in each columns
print_null_info(df)
Total Null value counts New_Price 6247 Price 1234 Power 175 Seats 53 Engine 46 Mileage 2 Name 0 Location 0 Year 0 Kilometers_Driven 0 Fuel_Type 0 Transmission 0 Owner_Type 0 Make 0 Model 0 Age_Of_Car 0 dtype: int64
# most rows don't have missing values now
num_missing = df.isnull().sum(axis=1)
num_missing.value_counts()
1 5232 2 1125 0 823 4 36 3 27 5 10 dtype: int64
823 rows has no missing values, so we cannot drop all missing value rows because only 12-13% of data has no missing values. We have to figure out a way to get those missing values.
5232 rows has one missing value(may be new price)
lets check and see if we can find a pattern and fix the missing values
df[num_missing == 4].sample(n=10)
| Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Mileage | Engine | Power | Seats | New_Price | Price | Make | Model | Age_Of_Car | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S.No. | ||||||||||||||||
| 4011 | Fiat Punto 1.3 Emotion | Pune | 2011 | 45271 | Diesel | Manual | First | 20.300 | NaN | NaN | NaN | NaN | 2.600 | Fiat | Punto 1.3 Emotion | 10 |
| 2074 | Maruti Swift 1.3 LXI | Pune | 2011 | 24255 | Petrol | Manual | First | 16.100 | NaN | NaN | NaN | NaN | 3.150 | Maruti | Swift 1.3 Lxi | 10 |
| 5185 | Maruti Swift 1.3 LXI | Delhi | 2012 | 52000 | Petrol | Manual | First | 16.100 | NaN | NaN | NaN | NaN | 3.650 | Maruti | Swift 1.3 Lxi | 9 |
| 2530 | BMW 5 Series 520d Sedan | Kochi | 2014 | 64158 | Diesel | Automatic | First | 18.480 | NaN | NaN | NaN | NaN | 17.890 | Bmw | 5 Series 520D Sedan | 7 |
| 1460 | Land Rover Range Rover Sport 2005 2012 Sport | Coimbatore | 2008 | 69078 | Petrol | Manual | First | 0.000 | NaN | NaN | NaN | NaN | 40.880 | Land | Rover Range Rover Sport 2005 2012 Sport | 13 |
| 4697 | Fiat Punto 1.2 Dynamic | Kochi | 2017 | 17941 | Petrol | Manual | First | 15.700 | NaN | NaN | NaN | NaN | 3.930 | Fiat | Punto 1.2 Dynamic | 4 |
| 2668 | Maruti Swift 1.3 VXi | Kolkata | 2014 | 32986 | Petrol | Manual | First | 16.100 | NaN | NaN | NaN | NaN | 4.240 | Maruti | Swift 1.3 Vxi | 7 |
| 4229 | Hyundai Santro Xing XG | Bangalore | 2005 | 79000 | Petrol | Manual | Second | 17.000 | NaN | NaN | NaN | NaN | 1.650 | Hyundai | Santro Xing Xg | 16 |
| 3810 | Honda CR-V AT With Sun Roof | Kolkata | 2013 | 27000 | Petrol | Automatic | First | 14.000 | NaN | NaN | NaN | NaN | 11.990 | Honda | Cr-V At With Sun Roof | 8 |
| 4712 | Hyundai Santro Xing XG | Pune | 2003 | 80000 | Petrol | Manual | Second | 17.000 | NaN | NaN | NaN | NaN | 0.900 | Hyundai | Santro Xing Xg | 18 |
check existing data using name and year and see we can get data for power, engine and seats, missing is consistent accors the columns
search existing data to find similar modal, make, year car has values, if yes use that value or if not fill with median value
## to find data from exiting
def find_data_for_engine(name, year):
"""
find data from existing data frame, using input name and year where engine is not null
"""
idx = np.where((df["Name"] == name) & (df["Year"] == year) & df["Engine"].notna())
return idx
missing_items = df[df["Engine"].isna()][["Name", "Year"]]
print(f"Missing values before fix : {missing_items.size}")
for ind in missing_items.index:
missing_name = missing_items["Name"][ind]
missing_year = missing_items["Year"][ind]
value_index = find_data_for_engine(missing_name, missing_year)
if len(value_index[0]) > 0:
value = df["Engine"][value_index[0][0]]
df["Engine"][ind] = value
## Fill median values by Year and Make
df["Engine"] = df.groupby(["Year", "Make"])["Engine"].apply(
lambda x: x.fillna(x.median())
)
## Fill median values by Make
df["Engine"] = df.groupby(["Make"])["Engine"].apply(lambda x: x.fillna(x.median()))
missing_items = df[df["Engine"].isna()][["Name", "Year"]]
print(f"Missing values after fix : {missing_items.size}")
Missing values before fix : 92 Missing values after fix : 0
## to find data from exiting
def find_data_for_power(name, year):
"""
find data from existing data frame, using input name and year where power is not null
"""
idx = np.where((df["Name"] == name) & (df["Year"] == year) & df["Power"].notna())
return idx
missing_items = df[df["Power"].isna()][["Name", "Year"]]
print(f"Missing values before fix : {missing_items.size}")
for ind in missing_items.index:
missing_name = missing_items["Name"][ind]
missing_year = missing_items["Year"][ind]
value_index = find_data_for_power(missing_name, missing_year)
if len(value_index[0]) > 0:
value = df["Power"][value_index[0][0]]
df["Power"][ind] = value
## Fill median values by Year and Make
df["Power"] = df.groupby(["Year", "Make"])["Power"].apply(
lambda x: x.fillna(x.median())
)
## Fill median values by Make
df["Power"] = df.groupby(["Make"])["Power"].apply(lambda x: x.fillna(x.median()))
missing_items = df[df["Power"].isna()][["Name", "Year"]]
print(f"Missing values after fix : {missing_items.size}")
Missing values before fix : 350 Missing values after fix : 4
## to find data from exiting
def find_data_for_seat(name, year):
"""
find data from existing data frame, using input name and year where power is not null
"""
idx = np.where((df["Name"] == name) & (df["Year"] == year) & df["Seats"].notna())
return idx
missing_items = df[df["Seats"].isna()][["Name", "Year"]]
print(f"Missing values before fix : {missing_items.size}")
for ind in missing_items.index:
missing_name = missing_items["Name"][ind]
missing_year = missing_items["Year"][ind]
value_index = find_data_for_seat(missing_name, missing_year)
if len(value_index[0]) > 0:
value = df["Seats"][value_index[0][0]]
df["Seats"][ind] = value
## Fill median values by Year and Make
df["Seats"] = df.groupby(["Year", "Make"])["Seats"].apply(
lambda x: x.fillna(x.median())
)
## Fill median values by Make
df["Seats"] = df.groupby(["Make"])["Seats"].apply(lambda x: x.fillna(x.median()))
missing_items = df[df["Seats"].isna()][["Name", "Year"]]
print(f"Missing values after fix : {missing_items.size}")
Missing values before fix : 106 Missing values after fix : 0
# check how many null values in each columns
print_null_info(df)
Total Null value counts New_Price 6247 Price 1234 Mileage 2 Power 2 Name 0 Location 0 Year 0 Kilometers_Driven 0 Fuel_Type 0 Transmission 0 Owner_Type 0 Engine 0 Seats 0 Make 0 Model 0 Age_Of_Car 0 dtype: int64
missing_items = df[df["Price"].isna()][["Name"]]
print(f"Missing values before fix : {missing_items.size}")
## Fill median values by Year, Make and Model
df["Price"] = df.groupby(["Year", "Make", "Model"])["Price"].apply(
lambda x: x.fillna(x.median())
)
## Fill median values by Year, Make
df["Price"] = df.groupby(["Year", "Make"])["Price"].apply(
lambda x: x.fillna(x.median())
)
missing_items = df[df["Price"].isna()][["Name"]]
print(f"Missing values after fix : {missing_items.size}")
Missing values before fix : 12 Missing values after fix : 12
missing_items = df[df["New_Price"].isna()][["Name"]]
print(f"Missing values after fix : {missing_items.size}")
## Fill median values by Year, Make and Model
df["New_Price"] = df.groupby(["Year", "Make", "Model"])["New_Price"].apply(
lambda x: x.fillna(x.median())
)
## Fill median values by Year, Make
df["New_Price"] = df.groupby(["Year", "Make"])["New_Price"].apply(
lambda x: x.fillna(x.median())
)
## Fill median values by Make, Model
df["New_Price"] = df.groupby(["Make", "Model"])["New_Price"].apply(
lambda x: x.fillna(x.median())
)
missing_items = df[df["New_Price"].isna()][["Name"]]
print(f"Missing values after fix : {missing_items.size}")
Missing values after fix : 224 Missing values after fix : 224
# check how many null values in each columns
print_null_info(df)
Total Null value counts New_Price 224 Price 12 Mileage 2 Power 2 Name 0 Location 0 Year 0 Kilometers_Driven 0 Fuel_Type 0 Transmission 0 Owner_Type 0 Engine 0 Seats 0 Make 0 Model 0 Age_Of_Car 0 dtype: int64
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 7253 entries, 0 to 7252 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 7253 non-null object 1 Location 7253 non-null category 2 Year 7253 non-null int64 3 Kilometers_Driven 7253 non-null int64 4 Fuel_Type 7253 non-null category 5 Transmission 7253 non-null category 6 Owner_Type 7253 non-null category 7 Mileage 7251 non-null float64 8 Engine 7253 non-null float64 9 Power 7251 non-null float64 10 Seats 7253 non-null float64 11 New_Price 7029 non-null float64 12 Price 7241 non-null float64 13 Make 7253 non-null category 14 Model 7253 non-null category 15 Age_Of_Car 7253 non-null int64 dtypes: category(6), float64(6), int64(3), object(1) memory usage: 1013.4+ KB
# writing a method that can take a column as input and print all the values falling outside the range of Limts of IQR
def check_outlier_using_IQR(column, limit=10):
"""
Check Limts * IQR for each values and prints the ones falls outside the range.
"""
quartiles = np.quantile(df[column][df[column].notnull()], [0.25, 0.75])
limit_iqr = limit * (quartiles[1] - quartiles[0])
outlier = df.loc[np.abs(df[column] - df[column].median()) > limit_iqr, column]
print();
print(f"Column : {column} Outlier(s) check");
print(f"Mean : {df[column].mean()}, Median : {df[column].median()}, Min : {df[column].min()}, Max : {df[column].max()}");
print(f"Q1 = {quartiles[0]}, Q3 = {quartiles[1]}, {limit}*IQR = {limit_iqr}, Total Outlier(s) : {outlier.size} \n")
print(outlier)
print("-" * 50)
for colname in df.columns[df.dtypes == "float64"]: # filter float64 columns
check_outlier_using_IQR(colname)
for colname in df.columns[df.dtypes == "int64"]: # filter int64 columns
check_outlier_using_IQR(colname)
Column : Mileage Outlier(s) check
Mean : 18.14158047165911, Median : 18.16, Min : 0.0, Max : 33.54
Q1 = 15.17, Q3 = 21.1, 10*IQR = 59.30000000000001, Total Outlier(s) : 0
Series([], Name: Mileage, dtype: float64)
--------------------------------------------------
Column : Engine Outlier(s) check
Mean : 1616.3930787260445, Median : 1493.0, Min : 72.0, Max : 5998.0
Q1 = 1198.0, Q3 = 1968.0, 10*IQR = 7700.0, Total Outlier(s) : 0
Series([], Name: Engine, dtype: float64)
--------------------------------------------------
Column : Power Outlier(s) check
Mean : 112.40513791201128, Median : 93.7, Min : 34.2, Max : 616.0
Q1 = 75.0, Q3 = 138.03, 10*IQR = 630.3, Total Outlier(s) : 0
Series([], Name: Power, dtype: float64)
--------------------------------------------------
Column : Seats Outlier(s) check
Mean : 5.278643320005515, Median : 5.0, Min : 0.0, Max : 10.0
Q1 = 5.0, Q3 = 5.0, 10*IQR = 0.0, Total Outlier(s) : 1158
S.No.
3 7.000
7 8.000
15 7.000
29 7.000
32 7.000
...
7175 7.000
7194 4.000
7198 7.000
7206 7.000
7223 7.000
Name: Seats, Length: 1158, dtype: float64
--------------------------------------------------
Column : New_Price Outlier(s) check
Mean : 22.029273100725643, Median : 11.15, Min : 3.91, Max : 375.0
Q1 = 8.254999999999999, Q3 = 19.27, 10*IQR = 110.15, Total Outlier(s) : 187
S.No.
4 128.000
13 139.000
14 139.000
63 171.000
126 122.977
148 128.000
215 171.000
216 128.000
311 127.000
404 127.000
410 139.000
426 171.000
480 160.000
506 124.880
521 127.000
525 122.977
536 171.000
557 128.000
589 129.285
645 171.000
701 139.000
704 128.000
717 128.000
755 171.000
821 171.000
822 128.000
887 171.000
902 128.000
932 171.000
986 128.000
1038 139.000
1090 128.000
1134 171.000
1204 139.000
1237 128.000
1265 128.000
1299 139.000
1316 128.000
1352 139.000
1356 171.000
1446 171.000
1477 122.977
1502 171.000
1505 139.000
1513 139.000
1570 171.000
1626 171.000
1654 171.000
1766 171.000
1868 128.000
1916 171.000
1918 128.000
1930 171.000
1988 139.000
1999 171.000
2005 128.000
2024 128.000
2040 171.000
2044 128.000
2056 140.000
2084 128.000
2178 127.000
2201 122.977
2263 139.000
2284 128.000
2411 171.000
2529 122.977
2690 128.000
2863 128.000
2884 171.000
2904 171.000
2929 128.000
3085 128.000
3132 136.000
3162 128.000
3199 166.000
3218 160.000
3236 171.000
3242 139.000
3261 128.000
3328 122.977
3410 139.000
3521 124.880
3604 128.000
3607 171.000
3621 122.977
3724 122.977
3752 160.000
3770 122.977
3839 139.000
3899 128.000
3939 171.000
4021 171.000
4039 171.000
4061 128.000
4079 230.000
4105 128.000
4136 122.977
4143 171.000
4153 171.000
4245 171.000
4247 139.000
4261 128.000
4267 122.977
4292 171.000
4311 122.977
4342 129.285
4466 139.000
4517 128.000
4581 128.000
4592 171.000
4602 171.000
4630 139.000
4674 171.000
4687 139.000
4701 139.000
4714 171.000
4757 122.977
4778 171.000
4801 139.000
4813 128.000
4814 139.000
4833 122.977
4846 128.000
4868 128.000
4869 139.000
4922 128.000
4925 127.000
5016 139.000
5022 139.000
5075 171.000
5142 171.000
5153 128.000
5239 128.000
5253 171.000
5282 171.000
5305 171.000
5311 139.000
5346 128.000
5382 160.000
5410 171.000
5411 139.000
5414 171.000
5416 139.000
5461 124.880
5530 139.000
5545 139.000
5586 128.000
5686 171.000
5728 122.977
5741 128.000
5842 171.000
5874 171.000
5885 171.000
6001 128.000
6006 122.977
6029 128.000
6060 128.000
6061 128.000
6072 128.000
6204 128.000
6208 128.000
6212 158.000
6287 128.000
6296 171.000
6354 375.000
6367 160.000
6383 122.977
6399 128.000
6413 128.000
6417 128.000
6426 160.000
6434 139.000
6458 171.000
6473 171.000
6513 128.000
6579 128.000
6704 171.000
6857 139.000
6951 171.000
6986 171.000
6989 171.000
7007 171.000
7072 128.000
7157 160.000
7180 171.000
7198 139.000
Name: New_Price, dtype: float64
--------------------------------------------------
Column : Price Outlier(s) check
Mean : 9.35116903742574, Median : 5.63, Min : 0.44, Max : 160.0
Q1 = 3.5, Q3 = 9.87, 10*IQR = 63.69999999999999, Total Outlier(s) : 26
S.No.
282 70.360
418 70.990
655 75.000
1194 69.500
1352 70.000
1505 97.070
1583 72.940
1690 70.800
1885 79.000
1974 93.670
1984 93.000
2095 83.960
2319 78.800
2422 79.000
2541 70.150
2659 75.000
2907 75.000
3489 75.000
3752 70.660
4079 160.000
4614 70.000
4691 90.000
5535 85.000
5603 72.000
5781 120.000
5919 100.000
Name: Price, dtype: float64
--------------------------------------------------
Column : Year Outlier(s) check
Mean : 2013.3653660554253, Median : 2014.0, Min : 1996, Max : 2019
Q1 = 2011.0, Q3 = 2016.0, 10*IQR = 50.0, Total Outlier(s) : 0
Series([], Name: Year, dtype: int64)
--------------------------------------------------
Column : Kilometers_Driven Outlier(s) check
Mean : 57811.84406452502, Median : 53416.0, Min : 171, Max : 775000
Q1 = 34000.0, Q3 = 73000.0, 10*IQR = 390000.0, Total Outlier(s) : 6
S.No.
340 775000
358 620000
1860 720000
2823 480000
3092 480000
4491 445000
Name: Kilometers_Driven, dtype: int64
--------------------------------------------------
Column : Age_Of_Car Outlier(s) check
Mean : 7.6346339445746585, Median : 7.0, Min : 2, Max : 25
Q1 = 5.0, Q3 = 10.0, 10*IQR = 50.0, Total Outlier(s) : 0
Series([], Name: Age_Of_Car, dtype: int64)
--------------------------------------------------
New_Price has lot of cars falls outside the 10 times limit because those are luxury cars, most of the cars in the data set a economy cars
Price has lot of cars falls outside the 10 times limit because those are luxury cars, most of the cars in the data set a economy cars
Land Rover Range Rover 3.0 Diesel LWB Vogue, Lamborghini Gallardo Coupe, aguar F Type 5.0 V8 S has high price - those are valid prices not outluers
cols_to_log = ["Price", "New_Price", "Kilometers_Driven"]
for colname in cols_to_log:
plt.hist(df[colname], bins=50)
plt.title(colname + " without log")
plt.show()
plt.hist(np.log(df[colname]), 50)
plt.title(colname + " with log")
plt.show()
Log on Price, New_Price and Kilometer_Driven really helped to distribute data evenly
cols_to_log = ["Price", "New_Price", "Kilometers_Driven"]
for colname in cols_to_log:
df[colname + "_log"] = np.log(df[colname])
df.drop(cols_to_log, axis=1, inplace=True)
# check how many null values in each columns
print_null_info(df)
Total Null value counts New_Price_log 224 Price_log 12 Mileage 2 Power 2 Name 0 Location 0 Year 0 Fuel_Type 0 Transmission 0 Owner_Type 0 Engine 0 Seats 0 Make 0 Model 0 Age_Of_Car 0 Kilometers_Driven_log 0 dtype: int64
# most rows don't have missing values now
num_missing = df.isnull().sum(axis=1)
num_missing.value_counts()
0 7019 1 229 2 4 3 1 dtype: int64
after checking on column and row level we still has 224 data missing new_price, 12 data missing price and 2 data missing mileage and power.
droping these values will not impact much with more than 95% of data has data for all columns
delete all missing values
df.dropna(inplace=True)
# check how many null values in each columns
print_null_info(df)
Total Null value counts Name 0 Location 0 Year 0 Fuel_Type 0 Transmission 0 Owner_Type 0 Mileage 0 Engine 0 Power 0 Seats 0 Make 0 Model 0 Age_Of_Car 0 Price_log 0 New_Price_log 0 Kilometers_Driven_log 0 dtype: int64
# Price Ranges
df["Price_Type"] = binned_price = pd.cut(
df["Price_log"],
[-np.inf, 1.5, 3, np.inf],
labels=["ECONOMY", "MID-SCALE", "LUXURY"],
)
df.sample(n=30)
| Name | Location | Year | Fuel_Type | Transmission | Owner_Type | Mileage | Engine | Power | Seats | Make | Model | Age_Of_Car | Price_log | New_Price_log | Kilometers_Driven_log | Price_Type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S.No. | |||||||||||||||||
| 2213 | Maruti Swift VXI | Mumbai | 2012 | Petrol | Manual | First | 18.600 | 1197.000 | 85.800 | 5.000 | Maruti | Swift Vxi | 9 | 1.295 | 1.556 | 9.616 | ECONOMY |
| 681 | Volkswagen Polo 1.2 MPI Highline | Pune | 2017 | Petrol | Manual | Second | 16.200 | 1199.000 | 74.000 | 5.000 | Volkswagen | Polo 1.2 Mpi Highline | 4 | 1.792 | 2.421 | 10.897 | MID-SCALE |
| 6811 | Honda City i VTEC CVT VX | Hyderabad | 2015 | Petrol | Automatic | First | 18.000 | 1497.000 | 117.300 | 5.000 | Honda | City I Vtec Cvt Vx | 6 | 2.110 | 2.416 | 10.592 | MID-SCALE |
| 401 | Mitsubishi Pajero Sport 4X2 AT DualTone BlackTop | Hyderabad | 2018 | Diesel | Automatic | First | 13.500 | 2477.000 | 175.560 | 7.000 | Mitsubishi | Pajero Sport 4X2 At Dualtone Blacktop | 3 | 3.332 | 3.524 | 9.393 | LUXURY |
| 6207 | Hyundai Verna CRDi ABS | Chennai | 2007 | Diesel | Manual | First | 18.100 | 1493.000 | 110.000 | 5.000 | Hyundai | Verna Crdi Abs | 14 | 0.668 | 2.644 | 11.238 | ECONOMY |
| 5571 | Hyundai i20 Asta 1.2 | Bangalore | 2014 | Petrol | Manual | Second | 18.600 | 1197.000 | 81.830 | 5.000 | Hyundai | I20 Asta 1.2 | 7 | 1.732 | 2.440 | 10.343 | MID-SCALE |
| 6345 | Honda Amaze S i-Vtech | Kolkata | 2013 | Petrol | Manual | First | 18.000 | 1198.000 | 86.700 | 5.000 | Honda | Amaze S I-Vtech | 8 | 1.335 | 2.073 | 11.159 | ECONOMY |
| 4055 | Mercedes-Benz E-Class E350 CDI | Delhi | 2014 | Diesel | Automatic | First | 13.000 | 2987.000 | 261.500 | 5.000 | Mercedes-Benz | E-Class E350 Cdi | 7 | 3.314 | 4.366 | 11.127 | LUXURY |
| 2799 | Mahindra Scorpio VLX 2.2 mHawk BSIII | Delhi | 2010 | Diesel | Manual | First | 12.050 | 2179.000 | 120.000 | 8.000 | Mahindra | Scorpio Vlx 2.2 Mhawk Bsiii | 11 | 1.482 | 2.411 | 11.184 | ECONOMY |
| 529 | Toyota Fortuner 3.0 Diesel | Hyderabad | 2010 | Diesel | Manual | First | 11.500 | 2982.000 | 171.000 | 7.000 | Toyota | Fortuner 3.0 Diesel | 11 | 2.565 | 3.366 | 11.513 | MID-SCALE |
| 2510 | BMW 3 Series 320d Luxury Line | Mumbai | 2014 | Diesel | Automatic | Second | 22.690 | 1995.000 | 190.000 | 5.000 | Bmw | 3 Series 320D Luxury Line | 7 | 3.045 | 3.997 | 9.831 | LUXURY |
| 67 | Mercedes-Benz C-Class Progressive C 220d | Coimbatore | 2019 | Diesel | Automatic | First | 0.000 | 1950.000 | 194.000 | 5.000 | Mercedes-Benz | C-Class Progressive C 220D | 2 | 3.574 | 3.895 | 9.640 | LUXURY |
| 2968 | Honda Brio S Option MT | Pune | 2012 | Petrol | Manual | First | 18.400 | 1198.000 | 86.800 | 5.000 | Honda | Brio S Option Mt | 9 | 0.833 | 1.890 | 10.800 | ECONOMY |
| 1341 | Maruti Ciaz ZDi Plus SHVS | Jaipur | 2015 | Diesel | Manual | First | 28.090 | 1248.000 | 88.500 | 5.000 | Maruti | Ciaz Zdi Plus Shvs | 6 | 1.960 | 1.770 | 10.609 | MID-SCALE |
| 4048 | Honda City 1.5 V MT | Delhi | 2010 | Petrol | Manual | First | 17.000 | 1497.000 | 118.000 | 5.000 | Honda | City 1.5 V Mt | 11 | 1.253 | 2.320 | 11.294 | ECONOMY |
| 1939 | Mahindra Scorpio S6 7 Seater | Kolkata | 2015 | Diesel | Manual | Second | 15.400 | 2179.000 | 120.000 | 7.000 | Mahindra | Scorpio S6 7 Seater | 6 | 2.140 | 2.457 | 10.866 | MID-SCALE |
| 5639 | Volkswagen Polo Petrol Comfortline 1.2L | Kolkata | 2013 | Petrol | Manual | First | 16.470 | 1198.000 | 73.900 | 5.000 | Volkswagen | Polo Petrol Comfortline 1.2L | 8 | 1.082 | 2.317 | 10.086 | ECONOMY |
| 2061 | Jeep Compass 2.0 Longitude | Coimbatore | 2018 | Diesel | Manual | First | 17.100 | 1956.000 | 170.000 | 5.000 | Jeep | Compass 2.0 Longitude | 3 | 2.947 | 3.112 | 10.411 | MID-SCALE |
| 658 | Toyota Corolla Executive (HE) | Chennai | 2007 | Petrol | Manual | First | 13.400 | 1794.000 | 125.000 | 5.000 | Toyota | Corolla Executive (He) | 14 | 1.099 | 3.073 | 11.396 | ECONOMY |
| 6391 | Hyundai i10 Sportz 1.2 | Pune | 2008 | Petrol | Manual | First | 20.360 | 1197.000 | 78.900 | 5.000 | Hyundai | I10 Sportz 1.2 | 13 | 0.693 | 2.644 | 11.345 | ECONOMY |
| 4319 | Honda Brio V MT | Mumbai | 2012 | Petrol | Manual | First | 19.400 | 1198.000 | 86.800 | 5.000 | Honda | Brio V Mt | 9 | 1.082 | 1.890 | 10.897 | ECONOMY |
| 6196 | Honda Brio VX | Bangalore | 2015 | Petrol | Manual | First | 19.400 | 1198.000 | 86.800 | 5.000 | Honda | Brio Vx | 6 | 1.723 | 2.416 | 11.045 | MID-SCALE |
| 3844 | Hyundai i20 1.2 Sportz | Kolkata | 2016 | Petrol | Manual | First | 17.000 | 1197.000 | 80.000 | 5.000 | Hyundai | I20 1.2 Sportz | 5 | 1.658 | 2.655 | 9.853 | MID-SCALE |
| 4944 | Maruti Swift Dzire VDI | Pune | 2014 | Diesel | Manual | First | 23.400 | 1248.000 | 74.000 | 5.000 | Maruti | Swift Dzire Vdi | 7 | 1.609 | 1.909 | 12.191 | MID-SCALE |
| 1989 | Hyundai i20 1.4 CRDi Magna | Hyderabad | 2012 | Diesel | Manual | First | 21.900 | 1396.000 | 90.000 | 5.000 | Hyundai | I20 1.4 Crdi Magna | 9 | 1.374 | 2.617 | 11.175 | ECONOMY |
| 5739 | Maruti Swift Dzire VDi | Hyderabad | 2014 | Diesel | Manual | First | 19.300 | 1248.000 | 73.900 | 5.000 | Maruti | Swift Dzire Vdi | 7 | 1.887 | 1.909 | 11.240 | MID-SCALE |
| 253 | Honda Jazz 1.2 VX i VTEC | Coimbatore | 2017 | Petrol | Manual | First | 18.700 | 1199.000 | 88.700 | 5.000 | Honda | Jazz 1.2 Vx I Vtec | 4 | 1.970 | 2.685 | 10.811 | MID-SCALE |
| 1249 | Maruti Swift VXi BSIV | Chennai | 2014 | Petrol | Manual | Second | 16.100 | 1197.000 | 85.000 | 5.000 | Maruti | Swift Vxi Bsiv | 7 | 1.504 | 1.909 | 10.897 | MID-SCALE |
| 425 | Hyundai Creta 1.6 SX Plus Dual Tone Petrol | Coimbatore | 2017 | Petrol | Manual | First | 15.290 | 1591.000 | 121.300 | 5.000 | Hyundai | Creta 1.6 Sx Plus Dual Tone Petrol | 4 | 2.487 | 2.083 | 10.050 | MID-SCALE |
| 670 | Maruti Ciaz ZXi | Pune | 2014 | Petrol | Manual | First | 20.730 | 1373.000 | 91.100 | 5.000 | Maruti | Ciaz Zxi | 7 | 1.792 | 1.909 | 10.810 | MID-SCALE |
since Model, Make are made from name and Age made from year - we can drop Name and Year
df.drop(["Name", "Year"], axis=1, inplace=True)
df.sample(n=10)
| Location | Fuel_Type | Transmission | Owner_Type | Mileage | Engine | Power | Seats | Make | Model | Age_Of_Car | Price_log | New_Price_log | Kilometers_Driven_log | Price_Type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S.No. | |||||||||||||||
| 4921 | Mumbai | Petrol | Manual | First | 18.600 | 1197.000 | 81.830 | 5.000 | Hyundai | I20 Sportz 1.2 | 5 | 1.749 | 2.655 | 10.086 | MID-SCALE |
| 4525 | Pune | Diesel | Manual | First | 15.100 | 2179.000 | 140.000 | 7.000 | Mahindra | Xuv500 W8 2Wd | 9 | 1.975 | 2.411 | 11.462 | MID-SCALE |
| 4871 | Bangalore | Diesel | Manual | Third | 13.100 | 2499.000 | 141.000 | 7.000 | Ford | Endeavour 2.5L 4X2 | 10 | 1.939 | 2.241 | 11.170 | MID-SCALE |
| 2527 | Pune | CNG | Manual | Third | 17.000 | 1086.000 | 83.000 | 5.000 | Hyundai | Santro Xing Xo Cng | 16 | 0.182 | 2.498 | 11.513 | ECONOMY |
| 6071 | Mumbai | Diesel | Automatic | First | 11.400 | 2953.000 | 153.860 | 7.000 | Ford | Endeavour 3.0L 4X4 At | 11 | 1.675 | 2.241 | 11.562 | MID-SCALE |
| 6508 | Bangalore | Diesel | Manual | Second | 27.620 | 793.000 | 47.000 | 5.000 | Maruti | Celerio Zdi | 6 | 1.281 | 1.770 | 9.741 | ECONOMY |
| 2117 | Kochi | Petrol | Automatic | First | 21.400 | 1197.000 | 83.100 | 5.000 | Maruti | Baleno Delta Automatic | 3 | 2.066 | 2.323 | 9.984 | MID-SCALE |
| 135 | Kochi | Petrol | Automatic | First | 17.000 | 1591.000 | 121.300 | 5.000 | Hyundai | Verna Vtvt 1.6 At Sx Plus | 2 | 2.442 | 2.602 | 9.445 | MID-SCALE |
| 4432 | Kolkata | Petrol | Manual | First | 22.740 | 796.000 | 47.300 | 5.000 | Maruti | Alto 800 Vxi | 5 | 0.892 | 2.149 | 10.098 | ECONOMY |
| 420 | Kochi | Petrol | Manual | First | 18.900 | 1197.000 | 81.860 | 5.000 | Hyundai | Grand I10 Sportz Edition | 3 | 1.705 | 2.131 | 10.605 | MID-SCALE |
how data looks after all treatments?
# Let's look at the statistical summary of the data
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Mileage | 7019.000 | 18.210 | 4.512 | 0.000 | 15.290 | 18.250 | 21.100 | 33.540 |
| Engine | 7019.000 | 1613.155 | 581.855 | 624.000 | 1198.000 | 1493.000 | 1968.000 | 5461.000 |
| Power | 7019.000 | 112.134 | 51.449 | 34.200 | 75.000 | 94.000 | 138.030 | 503.000 |
| Seats | 7019.000 | 5.279 | 0.796 | 0.000 | 5.000 | 5.000 | 5.000 | 10.000 |
| Age_Of_Car | 7019.000 | 7.532 | 3.166 | 2.000 | 5.000 | 7.000 | 9.000 | 23.000 |
| Price_log | 7019.000 | 1.840 | 0.849 | -0.821 | 1.267 | 1.749 | 2.303 | 5.075 |
| New_Price_log | 7019.000 | 2.648 | 0.828 | 1.364 | 2.111 | 2.411 | 2.959 | 5.438 |
| Kilometers_Driven_log | 7019.000 | 10.754 | 0.714 | 5.142 | 10.421 | 10.878 | 11.193 | 13.561 |
lets generate histplot and boxplot for all numeric features and understand how data spread accorss and any outliers
#we have already created common chart method for univariate analysis
#generating univariate chart for price column to check data spread and outlier range
generate_univariate_chart(data=df, feature="Price_log", hue=None, bins=10, kde=False)
Genrating Charts for feature : Price_log
## Finding mean value - using exp
np.exp(df["Price_log"].mean())
6.295193239506662
#generating univariate chart for new price column to check data spread and outlier range
generate_univariate_chart(
data=df, feature="New_Price_log", hue=None, bins=10, kde=False
)
Genrating Charts for feature : New_Price_log
## EXPanding Log 3 and
print(f"Majority of new price are under {np.exp(3)}")
##outliers
print(f"new price outliers are beyond {np.exp(5)}")
Majority of new price are under 20.085536923187668 new price outliers are beyond 148.4131591025766
#generating univariate chart for age of car column to check data spread and outlier range
generate_univariate_chart(data=df, feature="Age_Of_Car", hue=None, bins=10, kde=False)
Genrating Charts for feature : Age_Of_Car
#generating univariate chart for mileage column to check data spread and outlier range
generate_univariate_chart(data=df, feature="Mileage", hue=None, bins=10, kde=False)
Genrating Charts for feature : Mileage
#generating univariate chart for engine size column to check data spread and outlier range
generate_univariate_chart(data=df, feature="Engine", hue=None, bins=10, kde=False)
Genrating Charts for feature : Engine
# generating univariate chart for power column to check data spread and outlier range
generate_univariate_chart(data=df, feature="Power", hue=None, bins=10, kde=False)
Genrating Charts for feature : Power
# generating univariate chart for price column to check data spread and outlier range
generate_univariate_chart(data=df, feature="Seats", hue=None, bins=10, kde=False)
# converting seats to category type column
df["Seats"] = df["Seats"].astype("float64")
Genrating Charts for feature : Seats
# with all params
generate_univariate_chart(
data=df, feature="Kilometers_Driven_log", hue=None, bins=10, kde=False
)
Genrating Charts for feature : Kilometers_Driven_log
## Common method to genrate count graphs for category type columns
def generate_count_chart(df, feature):
"""
Builds count plot for given column
"""
plt.figure(figsize=(20, 8))
chart = sns.countplot(x=feature, data=df, order=df[feature].value_counts().index)
chart.set_xticklabels(
chart.get_xticklabels(),
rotation=45,
horizontalalignment="right",
fontweight="light",
fontsize="x-large",
)
generate_count_chart(df, "Make")
generate_count_chart(df, "Price_Type")
generate_count_chart(df, "Age_Of_Car")
generate_count_chart(df, "Location")
generate_count_chart(df, "Fuel_Type")
generate_count_chart(df, "Transmission")
generate_count_chart(df, "Owner_Type")
generate_count_chart(df, "Seats")
# Generate Corelation chart and understand how these features are related
def generate_matrix_chart(data):
"""
Generate corelation chart(heat-map) between all numerial features
"""
sns.set_style("darkgrid")
plt.figure(figsize=(12, 10))
sns.heatmap(
data.corr(), cmap="Spectral", linecolor="white", linewidths=1, annot=True
)
generate_matrix_chart(df)
we have to keep these in mind when build a Linear model
def generate_pairplot_chart(data, hue=None):
"""
Generate pair plot for all features
"""
sns.set_style("darkgrid")
sns.pairplot(data, hue=hue, palette="magma")
generate_pairplot_chart(df)
## this method generate joint plot of x vs y feature
def generate_bivariate_chart(data, xfeature, yfeature, hue=None):
"""
common method to generate joint plot for various columns
hue param is optional
"""
sns.set_style("darkgrid")
print(f"Genrating Charts for feature x : {xfeature}, y : {yfeature}")
sns.jointplot(
data=data,
x=xfeature,
y=yfeature,
palette="winter",
height=10,
kind="scatter",
hue=hue,
)
generate_bivariate_chart(
xfeature="Power", yfeature="Price_log", data=df, hue="Price_Type"
)
Genrating Charts for feature x : Power, y : Price_log
generate_bivariate_chart(
xfeature="Engine", yfeature="Price_log", data=df, hue="Price_Type"
)
Genrating Charts for feature x : Engine, y : Price_log
generate_bivariate_chart(
xfeature="New_Price_log", yfeature="Price_log", data=df, hue="Price_Type"
)
Genrating Charts for feature x : New_Price_log, y : Price_log
generate_bivariate_chart(
xfeature="Mileage", yfeature="Engine", data=df, hue="Price_Type"
)
Genrating Charts for feature x : Mileage, y : Engine
generate_bivariate_chart(
xfeature="Mileage", yfeature="Power", data=df, hue="Price_Type"
)
Genrating Charts for feature x : Mileage, y : Power
generate_bivariate_chart(xfeature="Engine", yfeature="Power", data=df, hue="Price_Type")
Genrating Charts for feature x : Engine, y : Power
generate_bivariate_chart(
xfeature="Power", yfeature="New_Price_log", data=df, hue="Price_Type"
)
Genrating Charts for feature x : Power, y : New_Price_log
generate_bivariate_chart(
xfeature="Engine", yfeature="New_Price_log", data=df, hue="Price_Type"
)
Genrating Charts for feature x : Engine, y : New_Price_log
# method to generate x vs y count plot and box plot
def generate_category_chart(data, xfeature, yfeature, hue=None):
"""
generate box plot and count plot for given x and y
"""
sns.set_style("darkgrid")
print(f"Genrating Charts for feature x : {xfeature}, y : {yfeature}")
# sns.jointplot(data=data , x=xfeature, y=yfeature, palette="winter", height=8,kind='scatter', hue=hue);
fig, axes = plt.subplots(2, figsize=(15, 20))
fig.suptitle(xfeature + " vs " + yfeature)
chart = sns.countplot(
x=xfeature,
data=data,
ax=axes[0],
hue=hue,
order=df[xfeature].value_counts().index,
)
chart.set_xticklabels(
chart.get_xticklabels(),
rotation=45,
horizontalalignment="right",
fontweight="light",
fontsize="x-large",
)
sns.boxplot(x=xfeature, y=yfeature, hue=hue, data=data, ax=axes[1])
plt.xticks(rotation=90)
generate_category_chart(
xfeature="Location", yfeature="Price_log", data=df, hue="Price_Type"
)
Genrating Charts for feature x : Location, y : Price_log
generate_category_chart(
xfeature="Fuel_Type", yfeature="Price_log", data=df, hue="Price_Type"
)
Genrating Charts for feature x : Fuel_Type, y : Price_log
generate_category_chart(
xfeature="Transmission", yfeature="Price_log", data=df, hue="Price_Type"
)
Genrating Charts for feature x : Transmission, y : Price_log
generate_category_chart(
xfeature="Owner_Type", yfeature="Price_log", data=df, hue="Price_Type"
)
Genrating Charts for feature x : Owner_Type, y : Price_log
generate_category_chart(
xfeature="Make", yfeature="Price_log", data=df, hue="Price_Type"
)
Genrating Charts for feature x : Make, y : Price_log
generate_category_chart(
xfeature="Age_Of_Car", yfeature="Price_log", data=df, hue="Price_Type"
)
Genrating Charts for feature x : Age_Of_Car, y : Price_log
lets build different linear models and understand its performance and accuracy in predicting the price of the used cars
all these models will be build, analyzed and scored similarly using same data
# we dont need Model, Delete Model Column(its almost unique)
X_1 = df.drop(["Price_log", "Model"], axis=1)
y_1 = df["Price_log"]
X_1.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 7019 entries, 0 to 7252 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Location 7019 non-null category 1 Fuel_Type 7019 non-null category 2 Transmission 7019 non-null category 3 Owner_Type 7019 non-null category 4 Mileage 7019 non-null float64 5 Engine 7019 non-null float64 6 Power 7019 non-null float64 7 Seats 7019 non-null float64 8 Make 7019 non-null category 9 Age_Of_Car 7019 non-null int64 10 New_Price_log 7019 non-null float64 11 Kilometers_Driven_log 7019 non-null float64 12 Price_Type 7019 non-null category dtypes: category(6), float64(6), int64(1) memory usage: 740.2 KB
dummy variable has to be created for all category columns so that those can be used as features
X_1 = pd.get_dummies(
X_1,
columns=X_1.select_dtypes(include=["object", "category"]).columns.tolist(),
drop_first=True,
)
X_1.head()
| Mileage | Engine | Power | Seats | Age_Of_Car | New_Price_log | Kilometers_Driven_log | Location_Bangalore | Location_Chennai | Location_Coimbatore | Location_Delhi | Location_Hyderabad | Location_Jaipur | Location_Kochi | Location_Kolkata | Location_Mumbai | Location_Pune | Fuel_Type_Diesel | Fuel_Type_Electric | Fuel_Type_LPG | Fuel_Type_Petrol | Transmission_Manual | Owner_Type_Fourth & Above | Owner_Type_Second | Owner_Type_Third | Make_Audi | Make_Bentley | Make_Bmw | Make_Chevrolet | Make_Datsun | Make_Fiat | Make_Force | Make_Ford | Make_Hindustan | Make_Honda | Make_Hyundai | Make_Isuzu | Make_Jaguar | Make_Jeep | Make_Lamborghini | Make_Land | Make_Mahindra | Make_Maruti | Make_Mercedes-Benz | Make_Mini | Make_Mitsubishi | Make_Nissan | Make_Opelcorsa | Make_Porsche | Make_Renault | Make_Skoda | Make_Smart | Make_Tata | Make_Toyota | Make_Volkswagen | Make_Volvo | Price_Type_MID-SCALE | Price_Type_LUXURY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S.No. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 0 | 26.600 | 998.000 | 58.160 | 5.000 | 11 | 1.707 | 11.184 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 19.670 | 1582.000 | 126.200 | 5.000 | 6 | 2.180 | 10.621 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2 | 18.200 | 1199.000 | 88.700 | 5.000 | 10 | 2.153 | 10.736 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 3 | 20.770 | 1248.000 | 88.760 | 7.000 | 9 | 1.556 | 11.374 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4 | 15.200 | 1968.000 | 140.800 | 5.000 | 8 | 4.852 | 10.613 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
X_1.shape
(7019, 58)
we have to split the data in training set and test test, model will be built using training set and evaluated performance on test set
# splitting the data in 70:30 ratio for train to test data, Random seed = 20
x_train_1, x_test_1, y_train_1, y_test_1 = train_test_split(
X_1, y_1, test_size=0.3, random_state=20
)
print("Total rows in train :", x_train.shape[0])
print("Total rows in test :", x_test.shape[0])
## buidling a LinearRegression and fit it using training set
# and creating a name label to track what type of model it is
lm_1 = LinearRegression()
lm_1.fit(x_train_1, y_train_1)
lm_1_name = "With all Features"
this method predicts R-squared, adjusted R-squared, errors like RMSE, MAE, MSE this method can be used for different Linear models
# function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
r2 = r2_score(targets, predictions)
n = predictors.shape[0]
k = predictors.shape[1]
return 1 - ((1 - r2) * (n - 1) / (n - k - 1))
# function to compute different metrics to check performance of a regression model
def model_performance_regression(
model, predictors, target, model_type=None, run_type=None
):
"""
Function to compute different metrics to check regression model performance
model: regressor
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
r2 = r2_score(target, pred) # to compute R-squared
adjr2 = adj_r2_score(predictors, target, pred) # to compute adjusted R-squared
rmse = np.sqrt(mean_squared_error(target, pred)) # to compute RMSE
mae = mean_absolute_error(target, pred) # to compute MAE
mse = mean_squared_error(target, pred) # to compute MAE
y_mean = target.mean()
SST = ((target - y_mean) ** 2).sum()
SSE = ((target - model.predict(predictors)) ** 2).sum()
# print(f"R-squared : {r2}")
# print(f"Adj. R-squared : {adjr2}")
# print(f"RMSE : {rmse}")
# print(f"MAE : {mae}")
# print(f"MSE : {mse}")
# print(f"SST : {SST}")
# print(f"SSE : {SSE}")
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Model": model_type,
"Run Type": run_type,
"Total Features": len(predictors.columns),
"Size": target.size,
"score %": (r2 * 100),
"R-squared": r2,
"Adj. R-squared": adjr2,
"RMSE": rmse,
"MAE": mae,
"MSE": mse,
"SST": SST,
"SSE": SSE,
},
index=[0],
)
return df_perf
# Checking model performance on train set
# print("Training Performance\n")
perf_training_1 = model_performance_regression(
lm_1, x_train_1, y_train_1, lm_1_name, "Training"
)
# Checking model performance on test set
# print("Test Performance\n")
perf_testing_1 = model_performance_regression(
lm_1, x_test_1, y_test_1, lm_1_name, "Test"
)
perf_training_1_result = pd.concat([perf_training_1, perf_testing_1])
perf_training_1_result
| Model | Run Type | Total Features | Size | score % | R-squared | Adj. R-squared | RMSE | MAE | MSE | SST | SSE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | With all Features | Training | 58 | 4913 | 94.138 | 0.941 | 0.941 | 0.204 | 0.157 | 0.042 | 3500.187 | 205.188 |
| 0 | With all Features | Test | 58 | 2106 | 93.711 | 0.937 | 0.935 | 0.216 | 0.159 | 0.047 | 1558.031 | 97.983 |
# creating a new data frame for this model so that existing data wont be impacted
df_2 = df.copy()
## Deleting CNG,LPG,Electric
indexNames = df_2[ (df_2['Fuel_Type'] == "CNG") | (df_2['Fuel_Type'] == "LPG") | (df_2['Fuel_Type'] == "Electric") ].index
# Delete these row indexes from dataFrame
df_2.drop(indexNames , inplace=True)
## Deleting seats not 5 or 7
indexNames = df_2[ (df_2['Seats'] != 5) & (df_2['Seats'] !=7) ].index
# Delete these row indexes from dataFrame
df_2.drop(indexNames , inplace=True)
## Deleting owner not first or second
indexNames = df_2[ (df_2['Owner_Type'] != "First") & (df_2['Owner_Type'] !="Second") ].index
# Delete these row indexes from dataFrame
df_2.drop(indexNames , inplace=True)
# we dont need Model, Delete Model Column(its almost unique)
X_2 = df_2.drop(["Price_log", "Model"], axis=1)
y_2 = df_2["Price_log"]
X_2 = pd.get_dummies(
X_2,
columns=X_2.select_dtypes(include=["object", "category"]).columns.tolist(),
drop_first=True,
)
X_2.head()
| Mileage | Engine | Power | Seats | Age_Of_Car | New_Price_log | Kilometers_Driven_log | Location_Bangalore | Location_Chennai | Location_Coimbatore | Location_Delhi | Location_Hyderabad | Location_Jaipur | Location_Kochi | Location_Kolkata | Location_Mumbai | Location_Pune | Fuel_Type_Diesel | Fuel_Type_Electric | Fuel_Type_LPG | Fuel_Type_Petrol | Transmission_Manual | Owner_Type_Fourth & Above | Owner_Type_Second | Owner_Type_Third | Make_Audi | Make_Bentley | Make_Bmw | Make_Chevrolet | Make_Datsun | Make_Fiat | Make_Force | Make_Ford | Make_Hindustan | Make_Honda | Make_Hyundai | Make_Isuzu | Make_Jaguar | Make_Jeep | Make_Lamborghini | Make_Land | Make_Mahindra | Make_Maruti | Make_Mercedes-Benz | Make_Mini | Make_Mitsubishi | Make_Nissan | Make_Opelcorsa | Make_Porsche | Make_Renault | Make_Skoda | Make_Smart | Make_Tata | Make_Toyota | Make_Volkswagen | Make_Volvo | Price_Type_MID-SCALE | Price_Type_LUXURY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S.No. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1 | 19.670 | 1582.000 | 126.200 | 5.000 | 6 | 2.180 | 10.621 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2 | 18.200 | 1199.000 | 88.700 | 5.000 | 10 | 2.153 | 10.736 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 3 | 20.770 | 1248.000 | 88.760 | 7.000 | 9 | 1.556 | 11.374 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4 | 15.200 | 1968.000 | 140.800 | 5.000 | 8 | 4.852 | 10.613 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 6 | 23.080 | 1461.000 | 63.100 | 5.000 | 8 | 2.689 | 11.374 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
# splitting the data in 70:30 ratio for train to test data, Random seed = 20
x_train_2, x_test_2, y_train_2, y_test_2 = train_test_split(
X_2, y_2, test_size=0.3, random_state=20
)
## buidling a LinearRegression and fit it using training set
# and creating a name label to track what type of model it is
lm_2 = LinearRegression()
lm_2.fit(x_train_2, y_train_2)
lm_2_name = "With all Features with higly sale category"
# Checking model performance on train set
# print("Training Performance\n")
perf_training_2 = model_performance_regression(
lm_2, x_train_2, y_train_2, lm_2_name, "Training"
)
# Checking model performance on test set
# print("Test Performance\n")
perf_testing_2 = model_performance_regression(
lm_2, x_test_2, y_test_2, lm_2_name, "Test"
)
perf_training_2_result = pd.concat([perf_training_2, perf_testing_2])
perf_training_2_result
| Model | Run Type | Total Features | Size | score % | R-squared | Adj. R-squared | RMSE | MAE | MSE | SST | SSE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | With all Features with higly sale category | Training | 58 | 4562 | 94.396 | 0.944 | 0.943 | 0.198 | 0.151 | 0.039 | 3195.380 | 179.084 |
| 0 | With all Features with higly sale category | Test | 58 | 1956 | 93.206 | 0.932 | 0.930 | 0.211 | 0.157 | 0.044 | 1279.445 | 86.921 |
perf_training_compare = pd.concat([perf_training_1_result, perf_training_2_result])
perf_training_compare
| Model | Run Type | Total Features | Size | score % | R-squared | Adj. R-squared | RMSE | MAE | MSE | SST | SSE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | With all Features | Training | 58 | 4913 | 94.138 | 0.941 | 0.941 | 0.204 | 0.157 | 0.042 | 3500.187 | 205.188 |
| 0 | With all Features | Test | 58 | 2106 | 93.711 | 0.937 | 0.935 | 0.216 | 0.159 | 0.047 | 1558.031 | 97.983 |
| 0 | With all Features with higly sale category | Training | 58 | 4562 | 94.396 | 0.944 | 0.943 | 0.198 | 0.151 | 0.039 | 3195.380 | 179.084 |
| 0 | With all Features with higly sale category | Test | 58 | 1956 | 93.206 | 0.932 | 0.930 | 0.211 | 0.157 | 0.044 | 1279.445 | 86.921 |
# we dont need Model, Delete Model Column(its almost unique)
df_3 = df_2.copy()
X_3 = df_3.drop(["Price_log", "Model", "Power", "New_Price_log"], axis=1)
y_3 = df_3["Price_log"]
X_3 = pd.get_dummies(
X_3,
columns=X_3.select_dtypes(include=["object", "category"]).columns.tolist(),
drop_first=True,
)
# splitting the data in 70:30 ratio for train to test data, Random seed = 20
x_train_3, x_test_3, y_train_3, y_test_3 = train_test_split(
X_3, y_3, test_size=0.3, random_state=20
)
## buidling a LinearRegression and fit it using training set
# and creating a name label to track what type of model it is
lm_3 = LinearRegression()
lm_3.fit(x_train_3, y_train_3)
lm_3_name = "With Model 2 Featuresm, No Power, No New Price"
# Checking model performance on train set
# print("Training Performance\n")
perf_training_3 = model_performance_regression(
lm_3, x_train_3, y_train_3, lm_3_name, "Training"
)
# Checking model performance on test set
# print("Test Performance\n")
perf_testing_3 = model_performance_regression(
lm_3, x_test_3, y_test_3, lm_3_name, "Test"
)
perf_training_3_result = pd.concat([perf_training_3, perf_testing_3])
perf_training_3_result
| Model | Run Type | Total Features | Size | score % | R-squared | Adj. R-squared | RMSE | MAE | MSE | SST | SSE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | With Model 2 Featuresm, No Power, No New Price | Training | 56 | 4562 | 93.668 | 0.937 | 0.936 | 0.211 | 0.162 | 0.044 | 3195.380 | 202.320 |
| 0 | With Model 2 Featuresm, No Power, No New Price | Test | 56 | 1956 | 92.829 | 0.928 | 0.926 | 0.217 | 0.166 | 0.047 | 1279.445 | 91.754 |
perf_training_compare = pd.concat(
[perf_training_1_result, perf_training_2_result, perf_training_3_result]
)
perf_training_compare
| Model | Run Type | Total Features | Size | score % | R-squared | Adj. R-squared | RMSE | MAE | MSE | SST | SSE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | With all Features | Training | 58 | 4913 | 94.138 | 0.941 | 0.941 | 0.204 | 0.157 | 0.042 | 3500.187 | 205.188 |
| 0 | With all Features | Test | 58 | 2106 | 93.711 | 0.937 | 0.935 | 0.216 | 0.159 | 0.047 | 1558.031 | 97.983 |
| 0 | With all Features with higly sale category | Training | 58 | 4562 | 94.396 | 0.944 | 0.943 | 0.198 | 0.151 | 0.039 | 3195.380 | 179.084 |
| 0 | With all Features with higly sale category | Test | 58 | 1956 | 93.206 | 0.932 | 0.930 | 0.211 | 0.157 | 0.044 | 1279.445 | 86.921 |
| 0 | With Model 2 Featuresm, No Power, No New Price | Training | 56 | 4562 | 93.668 | 0.937 | 0.936 | 0.211 | 0.162 | 0.044 | 3195.380 | 202.320 |
| 0 | With Model 2 Featuresm, No Power, No New Price | Test | 56 | 1956 | 92.829 | 0.928 | 0.926 | 0.217 | 0.166 | 0.047 | 1279.445 | 91.754 |
df.corr().T
| Mileage | Engine | Power | Seats | Age_Of_Car | Price_log | New_Price_log | Kilometers_Driven_log | |
|---|---|---|---|---|---|---|---|---|
| Mileage | 1.000 | -0.593 | -0.503 | -0.321 | -0.305 | -0.261 | -0.422 | -0.134 |
| Engine | -0.593 | 1.000 | 0.852 | 0.433 | 0.048 | 0.678 | 0.657 | 0.160 |
| Power | -0.503 | 0.852 | 1.000 | 0.123 | -0.025 | 0.760 | 0.765 | 0.022 |
| Seats | -0.321 | 0.433 | 0.123 | 1.000 | -0.014 | 0.181 | 0.016 | 0.185 |
| Age_Of_Car | -0.305 | 0.048 | -0.025 | -0.014 | 1.000 | -0.505 | -0.008 | 0.507 |
| Price_log | -0.261 | 0.678 | 0.760 | 0.181 | -0.505 | 1.000 | 0.706 | -0.201 |
| New_Price_log | -0.422 | 0.657 | 0.765 | 0.016 | -0.008 | 0.706 | 1.000 | 0.016 |
| Kilometers_Driven_log | -0.134 | 0.160 | 0.022 | 0.185 | 0.507 | -0.201 | 0.016 | 1.000 |
# we dont need Model, Delete Model Column(its almost unique)
df_4 = df.copy()
X_4 = df_4.drop(
["Price_log", "Model", "Mileage", "Seats", "Kilometers_Driven_log"], axis=1
)
y_4 = df_4["Price_log"]
X_4 = pd.get_dummies(
X_4,
columns=X_4.select_dtypes(include=["object", "category"]).columns.tolist(),
drop_first=True,
)
# splitting the data in 70:30 ratio for train to test data, Random seed = 20
x_train_4, x_test_4, y_train_4, y_test_4 = train_test_split(
X_4, y_4, test_size=0.3, random_state=20
)
## buidling a LinearRegression and fit it using training set
# and creating a name label to track what type of model it is
lm_4 = LinearRegression()
lm_4.fit(x_train_4, y_train_4)
lm_4_name = "With Highly Correlated Features"
# Checking model performance on train set
# print("Training Performance\n")
perf_training_4 = model_performance_regression(
lm_4, x_train_4, y_train_4, lm_4_name, "Training"
)
# Checking model performance on test set
# print("Test Performance\n")
perf_testing_4 = model_performance_regression(
lm_4, x_test_4, y_test_4, lm_4_name, "Test"
)
perf_training_4_result = pd.concat([perf_training_4, perf_testing_4])
perf_training_4_result
| Model | Run Type | Total Features | Size | score % | R-squared | Adj. R-squared | RMSE | MAE | MSE | SST | SSE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | With Highly Correlated Features | Training | 55 | 4913 | 93.980 | 0.940 | 0.939 | 0.207 | 0.159 | 0.043 | 3500.187 | 210.698 |
| 0 | With Highly Correlated Features | Test | 55 | 2106 | 93.550 | 0.936 | 0.934 | 0.218 | 0.161 | 0.048 | 1558.031 | 100.489 |
perf_training_compare = pd.concat(
[
perf_training_1_result,
perf_training_2_result,
perf_training_3_result,
perf_training_4_result,
]
)
perf_training_compare
| Model | Run Type | Total Features | Size | score % | R-squared | Adj. R-squared | RMSE | MAE | MSE | SST | SSE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | With all Features | Training | 58 | 4913 | 94.138 | 0.941 | 0.941 | 0.204 | 0.157 | 0.042 | 3500.187 | 205.188 |
| 0 | With all Features | Test | 58 | 2106 | 93.711 | 0.937 | 0.935 | 0.216 | 0.159 | 0.047 | 1558.031 | 97.983 |
| 0 | With all Features with higly sale category | Training | 58 | 4562 | 94.396 | 0.944 | 0.943 | 0.198 | 0.151 | 0.039 | 3195.380 | 179.084 |
| 0 | With all Features with higly sale category | Test | 58 | 1956 | 93.206 | 0.932 | 0.930 | 0.211 | 0.157 | 0.044 | 1279.445 | 86.921 |
| 0 | With Model 2 Featuresm, No Power, No New Price | Training | 56 | 4562 | 93.668 | 0.937 | 0.936 | 0.211 | 0.162 | 0.044 | 3195.380 | 202.320 |
| 0 | With Model 2 Featuresm, No Power, No New Price | Test | 56 | 1956 | 92.829 | 0.928 | 0.926 | 0.217 | 0.166 | 0.047 | 1279.445 | 91.754 |
| 0 | With Highly Correlated Features | Training | 55 | 4913 | 93.980 | 0.940 | 0.939 | 0.207 | 0.159 | 0.043 | 3500.187 | 210.698 |
| 0 | With Highly Correlated Features | Test | 55 | 2106 | 93.550 | 0.936 | 0.934 | 0.218 | 0.161 | 0.048 | 1558.031 | 100.489 |
fward feature selection starts with an empty model and adds in variables one by one.**
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
# prepare data
# we dont need Model, Delete Model Column(its almost unique)
df_5 = df.copy()
X_5 = df_5.drop(["Price_log", "Model"], axis=1)
y_5 = df_5["Price_log"]
X_5 = pd.get_dummies(
X_5,
columns=X_5.select_dtypes(include=["object", "category"]).columns.tolist(),
drop_first=True,
)
# splitting the data in 70:30 ratio for train to test data, Random seed = 20
x_train_5, x_test_5, y_train_5, y_test_5 = train_test_split(
X_5, y_5, test_size=0.3, random_state=20
)
reg = LinearRegression()
# Build step forward feature selection
sfs = SFS(
reg,
k_features=x_train.shape[1],
forward=True, # k_features denotes the number of features to select
floating=False,
n_jobs=-1,
scoring="r2",
cv=5,
)
# Perform SFFS
sfs = sfs.fit(x_train_5, y_train_5)
# to plot the performance with addition of each feature
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
fig1 = plot_sfs(sfs.get_metric_dict(), kind="std_err", figsize=(15, 5))
plt.title("Sequential Forward Selection (w. StdErr)")
plt.xticks(rotation=90)
plt.show()
reg = LinearRegression()
# Build step forward feature selection
sfs = SFS(
reg, k_features=28, forward=True, floating=False, scoring="r2", n_jobs=-1, cv=5,
)
# Perform SFFS
sfs = sfs.fit(x_train_5, y_train_5)
# let us select the features which are important
feat_cols = list(sfs.k_feature_idx_)
print(feat_cols)
[0, 1, 2, 3, 4, 5, 6, 7, 9, 11, 14, 17, 21, 25, 27, 37, 38, 40, 42, 43, 44, 45, 50, 52, 53, 55, 56, 57]
# let us look at the names of the important features
x_train_5.columns[feat_cols]
Index(['Mileage', 'Engine', 'Power', 'Seats', 'Age_Of_Car', 'New_Price_log',
'Kilometers_Driven_log', 'Location_Bangalore', 'Location_Coimbatore',
'Location_Hyderabad', 'Location_Kolkata', 'Fuel_Type_Diesel',
'Transmission_Manual', 'Make_Audi', 'Make_Bmw', 'Make_Jaguar',
'Make_Jeep', 'Make_Land', 'Make_Maruti', 'Make_Mercedes-Benz',
'Make_Mini', 'Make_Mitsubishi', 'Make_Skoda', 'Make_Tata',
'Make_Toyota', 'Make_Volvo', 'Price_Type_MID-SCALE',
'Price_Type_LUXURY'],
dtype='object')
x_train_final = x_train_5[x_train_5.columns[feat_cols]]
# Creating new x_test with the same variables that we selected for x_train
x_test_final = x_test_5[x_train_final.columns]
# Fitting linear model
lin_reg_model2 = LinearRegression()
lin_reg_model2.fit(x_train_final, y_train_5)
LinearRegression()
# Checking model performance on train set
print("Training Performance\n")
lin_reg_model_train_perf = model_performance_regression(
lin_reg_model2, x_train_final, y_train_5, "Top 28 Features", "Train"
)
lin_reg_model_train_perf
Training Performance
| Model | Run Type | Total Features | Size | score % | R-squared | Adj. R-squared | RMSE | MAE | MSE | SST | SSE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Top 28 Features | Train | 28 | 4913 | 93.967 | 0.940 | 0.939 | 0.207 | 0.159 | 0.043 | 3500.187 | 211.176 |
# Checking model performance on test set
print("Test Performance\n")
lin_reg_model_test_perf = model_performance_regression(
lin_reg_model2, x_test_final, y_test_5, "Top 28 Features", "Test"
)
lin_reg_model_test_perf
Test Performance
| Model | Run Type | Total Features | Size | score % | R-squared | Adj. R-squared | RMSE | MAE | MSE | SST | SSE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Top 28 Features | Test | 28 | 2106 | 93.562 | 0.936 | 0.935 | 0.218 | 0.161 | 0.048 | 1558.031 | 100.304 |
perf_training_compare = pd.concat(
[
perf_training_1_result,
perf_training_2_result,
perf_training_3_result,
perf_training_4_result,
lin_reg_model_train_perf,
lin_reg_model_test_perf,
]
)
perf_training_compare
| Model | Run Type | Total Features | Size | score % | R-squared | Adj. R-squared | RMSE | MAE | MSE | SST | SSE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | With all Features | Training | 58 | 4913 | 94.138 | 0.941 | 0.941 | 0.204 | 0.157 | 0.042 | 3500.187 | 205.188 |
| 0 | With all Features | Test | 58 | 2106 | 93.711 | 0.937 | 0.935 | 0.216 | 0.159 | 0.047 | 1558.031 | 97.983 |
| 0 | With all Features with higly sale category | Training | 58 | 4562 | 94.396 | 0.944 | 0.943 | 0.198 | 0.151 | 0.039 | 3195.380 | 179.084 |
| 0 | With all Features with higly sale category | Test | 58 | 1956 | 93.206 | 0.932 | 0.930 | 0.211 | 0.157 | 0.044 | 1279.445 | 86.921 |
| 0 | With Model 2 Featuresm, No Power, No New Price | Training | 56 | 4562 | 93.668 | 0.937 | 0.936 | 0.211 | 0.162 | 0.044 | 3195.380 | 202.320 |
| 0 | With Model 2 Featuresm, No Power, No New Price | Test | 56 | 1956 | 92.829 | 0.928 | 0.926 | 0.217 | 0.166 | 0.047 | 1279.445 | 91.754 |
| 0 | With Highly Correlated Features | Training | 55 | 4913 | 93.980 | 0.940 | 0.939 | 0.207 | 0.159 | 0.043 | 3500.187 | 210.698 |
| 0 | With Highly Correlated Features | Test | 55 | 2106 | 93.550 | 0.936 | 0.934 | 0.218 | 0.161 | 0.048 | 1558.031 | 100.489 |
| 0 | Top 28 Features | Train | 28 | 4913 | 93.967 | 0.940 | 0.939 | 0.207 | 0.159 | 0.043 | 3500.187 | 211.176 |
| 0 | Top 28 Features | Test | 28 | 2106 | 93.562 | 0.936 | 0.935 | 0.218 | 0.161 | 0.048 | 1558.031 | 100.304 |
Create a scatterplot of the real test values versus the predicted values.
predictions = lm_2.predict( x_test_2)
plt.figure(figsize=(12,10))
plt.scatter(y_test_2,predictions)
plt.xlabel('Y Test')
plt.ylabel('Predicted Y')
Text(0, 0.5, 'Predicted Y')
plt.figure(figsize=(12, 10))
sns.histplot((y_test_2 - predictions), bins=10, kde=True)
<AxesSubplot:xlabel='Price_log', ylabel='Count'>
# let's check the coefficients and intercept of the model
coef_df = pd.DataFrame(
np.append(lm_2.intercept_, lm.coef_),
index=["Intercept"] + x_train_2.columns.tolist(),
columns=["Coefficients"],
)
coef_df
| Coefficients | |
|---|---|
| Intercept | 1.968 |
| Mileage | -0.009 |
| Engine | 0.000 |
| Power | 0.004 |
| Seats | 0.041 |
| Age_Of_Car | -0.096 |
| New_Price_log | 0.026 |
| Kilometers_Driven_log | -0.026 |
| Location_Bangalore | 0.064 |
| Location_Chennai | 0.011 |
| Location_Coimbatore | 0.034 |
| Location_Delhi | -0.050 |
| Location_Hyderabad | 0.045 |
| Location_Jaipur | -0.048 |
| Location_Kochi | -0.060 |
| Location_Kolkata | -0.150 |
| Location_Mumbai | -0.047 |
| Location_Pune | -0.043 |
| Fuel_Type_Diesel | 0.086 |
| Fuel_Type_Electric | -0.000 |
| Fuel_Type_LPG | -0.000 |
| Fuel_Type_Petrol | -0.086 |
| Transmission_Manual | -0.094 |
| Owner_Type_Fourth & Above | 0.000 |
| Owner_Type_Second | -0.045 |
| Owner_Type_Third | 0.000 |
| Make_Audi | 0.303 |
| Make_Bentley | -0.000 |
| Make_Bmw | 0.301 |
| Make_Chevrolet | -0.000 |
| Make_Datsun | -0.376 |
| Make_Fiat | -0.294 |
| Make_Force | -0.000 |
| Make_Ford | -0.168 |
| Make_Hindustan | 0.000 |
| Make_Honda | -0.107 |
| Make_Hyundai | -0.128 |
| Make_Isuzu | -0.333 |
| Make_Jaguar | 0.299 |
| Make_Jeep | 0.156 |
| Make_Lamborghini | -0.000 |
| Make_Land | 0.501 |
| Make_Mahindra | -0.165 |
| Make_Maruti | -0.108 |
| Make_Mercedes-Benz | 0.314 |
| Make_Mini | 0.540 |
| Make_Mitsubishi | -0.011 |
| Make_Nissan | -0.110 |
| Make_Opelcorsa | 0.000 |
| Make_Porsche | 0.000 |
| Make_Renault | -0.176 |
| Make_Skoda | -0.063 |
| Make_Smart | 0.000 |
| Make_Tata | -0.486 |
| Make_Toyota | 0.073 |
| Make_Volkswagen | -0.145 |
| Make_Volvo | 0.181 |
| Price_Type_MID-SCALE | 0.342 |
| Price_Type_LUXURY | 0.689 |
## key Features to get high prices
coef_df[coef_df > 0.2].dropna(axis=0)
| Coefficients | |
|---|---|
| Intercept | 1.968 |
| Make_Audi | 0.303 |
| Make_Bmw | 0.301 |
| Make_Jaguar | 0.299 |
| Make_Land | 0.501 |
| Make_Mercedes-Benz | 0.314 |
| Make_Mini | 0.540 |
| Price_Type_MID-SCALE | 0.342 |
| Price_Type_LUXURY | 0.689 |
## key Features to get high prices
coef_df[coef_df < -0.05].dropna(axis=0)
| Coefficients | |
|---|---|
| Age_Of_Car | -0.096 |
| Location_Delhi | -0.050 |
| Location_Kochi | -0.060 |
| Location_Kolkata | -0.150 |
| Fuel_Type_Petrol | -0.086 |
| Transmission_Manual | -0.094 |
| Make_Datsun | -0.376 |
| Make_Fiat | -0.294 |
| Make_Ford | -0.168 |
| Make_Honda | -0.107 |
| Make_Hyundai | -0.128 |
| Make_Isuzu | -0.333 |
| Make_Mahindra | -0.165 |
| Make_Maruti | -0.108 |
| Make_Nissan | -0.110 |
| Make_Renault | -0.176 |
| Make_Skoda | -0.063 |
| Make_Tata | -0.486 |
| Make_Volkswagen | -0.145 |
Automating the equation of the fit
to calculate price of any future used car
# Let us write the equation of linear regression
Equation = "Price = " + str(lm.intercept_)
print(Equation, end=" ")
for i in range(len(x_train_2.columns)):
if i != len(x_train_2.columns) - 1:
print(
"\n+ (", lm_2.coef_[i], ")*(", x_train_2.columns[i], ")", end=" ",
)
else:
print("\n+ (", lm_2.coef_[i], ")*(", x_train_2.columns[i], ")")
Price = 1.9675703833043392 + ( -0.0091536066246779 )*( Mileage ) + ( 8.800011588209187e-05 )*( Engine ) + ( 0.0044522015922807134 )*( Power ) + ( 0.04091854623113389 )*( Seats ) + ( -0.09553360817372422 )*( Age_Of_Car ) + ( 0.026060011968879804 )*( New_Price_log ) + ( -0.025530875779606402 )*( Kilometers_Driven_log ) + ( 0.0641071933485399 )*( Location_Bangalore ) + ( 0.011103849608016664 )*( Location_Chennai ) + ( 0.03381101500736915 )*( Location_Coimbatore ) + ( -0.050207434511672436 )*( Location_Delhi ) + ( 0.04457203122566265 )*( Location_Hyderabad ) + ( -0.047792866099684174 )*( Location_Jaipur ) + ( -0.06029914532680927 )*( Location_Kochi ) + ( -0.150174336066836 )*( Location_Kolkata ) + ( -0.04693452105273331 )*( Location_Mumbai ) + ( -0.04327649979405802 )*( Location_Pune ) + ( 0.08616622370325164 )*( Fuel_Type_Diesel ) + ( -5.221517662690189e-16 )*( Fuel_Type_Electric ) + ( -6.661338147750939e-16 )*( Fuel_Type_LPG ) + ( -0.08616622370325062 )*( Fuel_Type_Petrol ) + ( -0.09432156411944527 )*( Transmission_Manual ) + ( 2.3592239273284576e-16 )*( Owner_Type_Fourth & Above ) + ( -0.04528670665858933 )*( Owner_Type_Second ) + ( 1.1102230246251565e-16 )*( Owner_Type_Third ) + ( 0.3031777538724239 )*( Make_Audi ) + ( -1.3877787807814457e-15 )*( Make_Bentley ) + ( 0.3011491076838147 )*( Make_Bmw ) + ( -6.661338147750939e-16 )*( Make_Chevrolet ) + ( -0.3755423327481152 )*( Make_Datsun ) + ( -0.2936543168648398 )*( Make_Fiat ) + ( -6.38378239159465e-16 )*( Make_Force ) + ( -0.1682854204695558 )*( Make_Ford ) + ( 2.220446049250313e-16 )*( Make_Hindustan ) + ( -0.10704863831941518 )*( Make_Honda ) + ( -0.12806040598712873 )*( Make_Hyundai ) + ( -0.33327382978224224 )*( Make_Isuzu ) + ( 0.298837974467263 )*( Make_Jaguar ) + ( 0.1563055842979586 )*( Make_Jeep ) + ( -1.6653345369377348e-16 )*( Make_Lamborghini ) + ( 0.5013595084932326 )*( Make_Land ) + ( -0.16479031269514732 )*( Make_Mahindra ) + ( -0.10759411549332468 )*( Make_Maruti ) + ( 0.3143875839001142 )*( Make_Mercedes-Benz ) + ( 0.5404905248544429 )*( Make_Mini ) + ( -0.010720315120660935 )*( Make_Mitsubishi ) + ( -0.1104368003362399 )*( Make_Nissan ) + ( 0.0 )*( Make_Opelcorsa ) + ( 0.0 )*( Make_Porsche ) + ( -0.1762579116039613 )*( Make_Renault ) + ( -0.06270031845288808 )*( Make_Skoda ) + ( 0.0 )*( Make_Smart ) + ( -0.48580075319889493 )*( Make_Tata ) + ( 0.07313438157354718 )*( Make_Toyota ) + ( -0.14535156484686274 )*( Make_Volkswagen ) + ( 0.18067461677643792 )*( Make_Volvo ) + ( 0.3422128551096242 )*( Price_Type_MID-SCALE ) + ( 0.6888816750718074 )*( Price_Type_LUXURY )